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In this paper, memory optimization and architectural level modifications are 
introduced for realizing the low power residue number system (RNS) with 
improved flexibility for electroencephalograph (EEG) signal classification. 
The proposed RNS framework is intended to maximize the reconfigurability 
of RNS for high-performance finite impulse response (FIR) filter design. By 
replacing the existing power-hungry RAM-based reverse conversion model 
with a highly decomposed lookup table (LUT) model which can produce the 
results without using any post accumulation process. The reverse conversion 
block is modified with an appropriate functional unit to accommodate FIR 
convolution results. The proposed approach is established to develop and 
execute pre-calculated inverters for various module sets. Therefore, the 
proposed LUT-decomposition with RNS _ multiplication-based post- 
accumulation technology provides a high-performance FIR filter architecture 
that allows different frequency response configuration elements. 
Experimental results shows the superior performance of decomposing LUT- 
based direct reverse conversion over other existing reverse conversion 
techniques adopted for energy-efficient RNS FIR implementations. When 
compared with the conventional RNS FIR design with the proposed FSM 
based decomposed RNS FIR, the logic elements (LEs) were reduced by 
4.57%, the frequency component is increased by 31.79%, number of LUTs 
is reduced by 42.85%, and the power dissipation was reduced by 13.83%. 
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1. INTRODUCTION 


An electroencephalograph (EEG)-based signal measurements are not only investigated for the brain- 
computer interface (BCI) but also provides a diagnostic channel for many brain-related problems during 
clinical measurements [1]-[2]. The signal activity measures by non-invasive EEG records from the scalp are 
accumulated from other sources and artifacts [3]. To overcome this contamination during EEG signal 
classification, several signal processing and feature extraction methods are investigated for both high 
performances BCI and diagnostic measurements. The feature extraction includes spatial filtering to generate 
different forms of spatial patterns and incorporates covariance analysis to maximize the class differences in 
spatial scale. But EEG signals rhythmic activities are highly correlated with associated frequency bands [4]. 
To accomplish this task, raw EEG signals are pre-processed with some narrow-band spectral filters are just 
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before the spatial filtering [5]. But selecting the most prominent frequency band manually for each class and 
the associated optimization models leads to some computational burdens. In recent years investigation of 
spatial and spectral filters has been steadily emerged for reducing the false rate. The time-domain 
information's EEG signals are not sufficient for classification and evaluation. In real-time EEG signal is 
evaluated with different forms by the individuals for clinical diagnosis [6]. 

In general analysis of EEG signals comes with the formulation of the spatial weights obtained from 
electrodes which are known as a common spatial pattern. For signal classification, neural networks are most 
commonly preferred which can provide a better framework for characterizing these spatial patterns. To process 
the weights using multilayer perceptron (MLP), all static weights can be replaced with finite impulse response 
(FIR) filters as an extension of existing neural networks to accumulate the frequency bands that relate to the 
most appropriate signal measures. Higashi and Tanaka [7], by optimizing the objective function (a natural 
extension of CSP), a discriminate filter bank was developed using FIR filter design. Meng et al. [8] utilized both 
spatial and spectral features and accordingly learning task is accomplished to maximize class discriminations 
among different class labels. By estimating the parametric distribution of these spatial-spectral features some 
mutual information (MI) is derived and the cost function is optimized during the iterative learning process. 

As opposed to optimizing the cost function of Spatio-temporal features, several models have been 
proposed [9] that optimize the filters for cost-effective signal measurement and analysis. On the other side, 
the successfulness of common spatial pattern analysis (CSP) directly depends on the order of the FIR filter. 
The core objective of this paper is developing a new optimized FIR core for Spatio-temporal. For better 
discriminability, FIR temporal filters require improved flexibility and length should be in higher order. In this 
paper, a high-performance FIR filter design using a residue number system (RNS) is proposed which can 
accommodate the benefits of both parallel processing, complexity reduction, and energy efficiency. In this 
context, the computational intensiveness of FIR filter design processing blocks is effectively optimized using 
residue number system (RNS) arithmetic. 


2. RESEARCH METHOD 

Several attempts have been [10]-[11] made to optimize the accumulator and multiplication unit to 
provide complete system requirement of filter design with appropriate hardware units. In existing works, 
methodologies invented for RNS computation are broadly categorized into two types-LUT-based models and 
conventional binary modules. Lookup table based RNS system offers improved system performance over a 
smaller range of moduli sets and binarized RNS model shows better performance over large size moduli. In 
most cases, the performance metrics in terms of accuracy in the FIR filter largely depend on the number of 
FIR coefficients and associated precision levels. 

However, LUT-based reverse computation for RNS arithmetic has a large computational cost and 
takes a long time. He came up with a binary coded structure for calculating residues and a thermometer 
coded style for generating modular inner products in [12]. When designing FIR filters, this distributed 
arithmetic uses no carry propagation in accumulating and pre-computed LUT blocks in order to maximize 
operating speed and minimize hardware complexity [13]. High-performance booth multiplier for FIR filter 
design flexibility and low complexity are incorporated into the RNS accumulator. Because it depends on a 
redundant residue number system [14], the low-cost fault-tolerant FIR filter does not require any additional 
hardware. The proper down convert moduli set is used for FIR calculations to eliminate faults generated by 
MAC computations based on single event upset (SEU) [15]. A less modular multiplication binary number to 
residue number converter was presented in order to reduce hardware complexity and power consumption. A 
pre-loaded product block reduces the computational cost and latency of generating partial products for each 
FIR tap in this technique. As discussed in [16], end-around carry units (EAC) eliminate the performance 
trade-off inherent in any RNS FIR filter design that includes additional taps. Touil et al. [17] FIR filter design 
optimization is carried out using non-recursive filtering algorithms and an appropriate mathematical model. 
Some structural implementation is invented for a modular digital filter by using the residue ring measures and 
the symmetric characteristics FIR filter responses for optimization. As compared to the existing modular 
filter the proposed FIR design reduces impulse response by half of the FIR length. Here the hardware 
complexity is considerably reduced using finite field algebra and the accuracy of computation is also 
improved in the modular arithmetic-based digital FIR filter implementations. 

The performance metrics of the hardware implementation of the RNS system depend on the 
operations reverse conversion process which is complex in nature. To mitigate this problem the diagonal 
function (DF) is introduced in [18] for RNS construction which allows efficient hardware implementations of 
modulo 2n, 2n-1, or 2n+1. Here different approaches are constructed using DF to accommodate the different 
forms of magnitude comparison and reverse conversion process. Jaberipur et al. [19] Diminished-1 (D1) 
encoding is introduced in the RNS system and its potential metrics on modulo-(2n + 1) addition, subtraction, 
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unified add/sub unit, and multiplication are investigated. The impact of D1 representation in RNS is validated 
by implementing finite impulse response filters and discrete cosine transform applications. 

Convolutional neural networks are widely used for many pattern recognition systems. But it requires 
a large amount of memory to hold weights during the process of learning. To reduce the hardware cost of 
CNN implementation the residue number system (RNS) is used in [20] each layer of the convolutional neural 
network. The hardware implementation of the RNS based CNN showed that the use of residue arithmetic 
saves 7.86%-37.78% hardware resource as compared to the conventional two’s complement method. The 
RNS CNN also reduces the overall recognition time by 41.17%. Valueva et al. [21] CNN architecture is 
decomposed into hardware and software to maximize the system performance of RNS hardware components. 
The inclusion of software parts offers significant memory efficiency while during the process of learning. 
Cardarilli et al. [22] the characteristics of RNS and conventional TCS are analyzed in the different stages of 
DSP applications and unique design space exploration (DSE) methodology is introduced for high- 
performance digital FIR filter implementation. This DSE offers energy efficiency for several emerging 
applications like machine learning and internet-of-things. 

Vinitha and Sharma [23] DA based FIR implementation is proposed using an efficient lookup table 
(LUT) design. Hereby utilizing the even multiple storage (EMS) scheme the size of the LUT-based multiplier 
is reduced by half which reduces the path delay and optimizes the computational complexity overhead in FIR 
design. In [24] memory-efficient ROM-free reverse converters design is proposed for the different sizes of 
moduli set to perform high-speed arithmetic with highly balanced modulus and appropriate adder units. The 
memory-less RNS offers significant energy efficiency with some notable path delay accumulation due to 
dynamic post computation. 

Pontarelli et al. [25] presented a comparative analysis of the FIR filter design using the RNS with 
other well-known models which make use of the conventional positional number system. Here RNS based 
FPGA hardware implementation shows that the frequency of FIR filters is increased by about 4 times, and 
computational complexity is reduced by 3 times when the RNS system is incorporated for FIR computation 
as compared to the traditional binary number system. In addition to the path delay and area efficiency, the 
RNS based FIR design also offers energy efficiency of up to 23%. In RNS system inter-modulo computation 
consumes maximum resources which are directly associated with the complex reverse conversion process. 

NavaeiLavasani et al. [26] used mixed-radix conversion (MRC) algorithm for implementing RNS 
reverse converter design for ternary computation with moduli set {3n—2, 3n-1, 3n}. The integration of the 
RNS system in ternary DSP applications offers effective number representation and maximizes the inherent 
parallelism for a high throughput rate. For hardware efficient RNS implementation in applications like edge 
computing and FIR filter design, it is required to embed these reverse converters as a simplified arithmetic 
unit for each class of moduli sets. And effective hardware resource utilization [26] also saves a considerable 
amount of area and power consumption in hardware realization of RNS system with some negligible penalty 
in path delay. The experimental results show the superiority of reverse converter optimization using different 
design methodology. 

Using less hardware, systolic array architecture [27] lowers the critical path, allowing for faster 
processing times. Because of the use of parallel processing and pipelines, the overall chip size and power 
consumption have been reduced significantly. Optimizing hardware resources for higher-order filters presents 
a major difficulty. 


3. PROPOSED RNS SYSTEM IN FIR FILTER DESIGN 

For FIR filter design as shown in Figure 1, the input signal samples and filter coefficients not only 
take real values but also include negative numbers. In RNS-number system, only positive integers are used 
for arithmetic computation and also having dynamic range constraints which are in the range of [0, M-1]. To 
accommodate the negative numbers for RNS based FIR filter design implementation, some number 
conversions are used for binary representation. 

In recent years, DSP applications have dealt with high-bit-size samples for enhanced precision, 
which necessitated a huge amount of hardware resources for computation. The word length size is always a 
trade-off, regardless of the arithmetic models employed for data calculation. To keep the performance metrics 
in terms of speed, some mathematics based on modular fields for arithmetic computations must be invented. 
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Figure 1. FIR filter design architecture using multipliers and adders 


3.1. Residue number system 

In many real-time applications, RNS proves to be a potential alternative to the traditional 2's 
complement algorithm due to its inherent properties of parallel computation and predetermined parameter 
measurements using modulo-set formulations. However, a fully automated RNS digital system 
implementation is still not possible due to its complex post-processing steps, performing pre-computation 
during RNS arithmetic computations for real-time applications and storing the results in memory. In addition, 
the modular and distributed nature of RNS provides additional performance metrics and is widely used in 
many fields such as cloud, wireless communication systems, and DSP applications [28]. Its tolerance for soft 
errors and power efficiency makes RNS-based data computing the most prominent one for optimizing digital 
designs. Through RNS based arithmetic throughput rate is moderately achieved with the degree of 
computational parallelism and decomposition levels during hardware implementation. 


3.2. Modulo mi multiplier 

In RNS based system, the selected moduli's are constants and remain within that limit even after the 
residue computation. For residue computation, integer arithmetic is used which is performed in parallel. The 
performance efficiency of integer arithmetic is also influencing the overall RNS system performance. The 
advantages of RNS system are high performance computation, energy efficient and hardware complexity 
reduction. 

As shown in Figure 2 both input sample values (xn) and FIR coefficients (hn) are converted into 
residues using moduli conversion block and final FIR convolution result (Yn) is generated using reverse 
conversion block. The problems over own dynamic range cover are solved by adopting different sizes of 
moduli and the various sets of moduli components. The required dynamic range is pre-determined and 
moduli’s are selected accordingly else RNS will produce erroneous results. To accomplish this task some 
statistical measures are required to pre-determine the bit sizes of moduli and the moduli set numbers which 
can accommodate all possible ranges of tap results during FIR computation. 


Xn hn 


Yn 


Figure 2. Decomposed LUT based as a function of applied field 
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3.3. Reverse converter 

As per the Chinese remainder theorem (CRT), the reverse converter module required some unified 
computations for converting the resultant residue into an actual integer. Though the direct digital 
implementation of the reverse converter difficult task to accomplish due to its iterative division units and 
numerous multiplication units still many efficient models are invented to generate the results. In most 
memory-based pre computation is used for reverse converter design. The reduction of inter-module carries 
propagation and residue level multi-channel arithmetic computations in RNS is significantly influencing the 
system performance over intensive DSP applications. There is no carry propagation between residual 
channels and the residue computation is performed concurrently over each residue channel which leads to 
optimized critical path delay overhead. 


4. EXPERIMENTAL RESULTS 
4.1. Performance analyzes 

The performance evaluation of the FSM-based RNS FIR design is validated over different FIR taps 
to examine the trade-off measure. Based on the observation as shown in Tables 1 and 2, the proposed LUT 
decomposition driven DA based RNS FIR model outperformed all other state-of-the-art methods in terms of 
achievable throughput since it incorporates the metrics from both parallel computations as well as reduce 
critical path during reverse computation. The suggested RNS system takes advantage of inherent concurrency 
within residue channels as well as FPGA device capabilities. In Table 1, the 8 bit word length for moduli set 
(7, 8, 9) the area is decreased by 4.5% whereas for 16 bit word length for moduli set (31, 32, 33) the area is 
decreased by 15.36%. In 8 bit word length for moduli set (7, 8, 9) the frequency component is increased by 
24.08% whereas for 16 bit word length for moduli set (31, 32, 33) the frequency is improved by 16.18%. In 
Table 2, the 4 tap FIR filter, area in the proposed method is decreased by 4.15% and the frequency 
component is increased by 13.30% whereas in 16 tap FIR filter area is decreased by 4.98% and the frequency 
is increased by 19.57% when compared with the conventional reverse computation method. 


Table 1. Comparison of performance trade-offs based on input word length 


Input Word Moduli set RNS with conventional reverse RNS with LUT decomposed reverse 
length size (2n+1,2n,2n-1) computation [29] computation (Proposed method) 
Area (LE’s) Fmax Area (LE’s) Fmax 
8 bit (7,8,9) 4281 57.3MHz 4,088 75.48MHz 
16 bit (31,32,33) 14374 24.96MHz 12165 29.78MHz 


Table 2. Analyses on the performance of LUT-decomposed RNS FIR Filter 


FIR length RNS with conventional reverse RNS with LUT decomposed reverse 
computation [29] computation (Proposed method) 
Area (LE’s) Enaz Area (LE’s) Frax 
4 tap 2096 63.46MHz 2,009 73.2MHz 
16 tap 8623 57.3MHz 8,193 71.25MHz 


4.2. Trade-off analyzes 

Figure 1 illustrates an architecture that is compatible with all potential dynamic word length 
variations. However, each of the RAM sizes is resized based on their hierarchical moduli information, and 
linearization is performed over aspect ratios to maximize operand bit size. The aspect ratios are derived 
analytically using the model reverse conversion technique outlined in the previous section. Aspect ratios of 
certain moduli sets are chosen to be near to the values of the moduli sets. 

As demonstrated in Table 3, the growth of RAM size with moduli set bit size is exponential, which 
is convenient in terms of the attainable frequency response. With the parallel modulo FIR implementation, 
this table displays the FPGA synthesis results in terms of the number of LEs and delay measures for different 
values of moduli sets. The maximum frequency of the filter for the 8-bit word length is 57.3MHz, while the 
maximum frequency of the filter for the 12-bit and 16-bit word lengths is limited by the accumulation of 
reverse conversion results. The state-of-the-art comparison of proposed FIR design with other FPGA RNS 
FIR methods for three moduli set with a 32-bit unsigned input for a Xilinx Spartan 3E FPGA device has 
reduced 50% power dissipation and the frequency component is increased by 36.56% in proposed FSM 
decomposed RNS FIR design [30], [31]. 
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Table 3. State of the art comparison of proposed FIR design with other FPGA RNS FIR models 


Methods RNS Model Input/Coefficient FPGA device Speed Power 
used size used (MHz) dissipation 
Core Functional decimal Five moduli set 32-bit unsigned Xilinx Spartan 230MHz 175mW 
equivalent binary conversion 3E 
RNS FIR design [29] 
Proposed FSM decomposed Three moduli set 32-bit unsigned Xilinx Spartan 362.58 MHz 73.49mW 
RNS FIR design 3E 


4.3. Critical path retention performance measure 

Observation from Figure 3 can be made that the performance trade-off comparison for the 
conventional reverse computation of RNS with the proposed model, the filter tap extension with the logical 
elements were compared and the total performance loss is smaller when tested with possible higher-order 
during FIR filter design. For 4 tap and 16 tap filter, the logic elements are utilized as 2009 and 8193 when 
compared with the conventional values 2096 and 8623 respectively. Observation from Figure 4 can be made 
that during filter convolution, path delay management in the FIR MAC network is achieved utilizing LUT- 
driven RNS networks, which operate as delay optimization models. The total time necessary to formulate the 
convolution output is reduced in this fashion. It's worth noting that the filter length must be sufficient to keep 
the majority of the finite filter coefficients. As a result, executing high-order lengths has a low computational 
trade off when compared with the conventional reverse computation with the proposed model. For 4 tap, 8 
tap and 16 tap filter the logic elements were used as 2096, 4281, and 8623 respectively. 


m4tap = 16tap 9000 4 
8000 4 

7000 4 

a 6000 4 

5000 4 

4000 4 

3000 4 

2000 4 

1000 4 


Area (LE's) 


0 
Conventional Method Proposed Method 4 tap 8 tap 16 tap 
Filter Length Filter Length 
Figure 3. Performance trade-off comparisons over Figure 4. Complexity trade-off comparison over FIR 
FIR Filter length Filter length for the proposed model 


4.4. Comparison with other state-of-the-art RNS FIR model 

Due to concurrent FSM-based LUT transformation, the proposed RNS system can achieve a 
significant path delay optimization margin against some known RNS FIR design and post accumulation 
driven reverse conversion can solve power management issues and mitigate all sorts of energy related 
problems in the RNS FIR system. As compared to the RNS FIR model invented in the proposed RNS 
consumes lesser logical resources with the least path delay propagation due to simplified reverse conversion 
operations. Moreover, the dynamic ranges of RNS which are largely dependent on moduli sizes can be 
extended without causing performance trade-off. The performance metrics of the proposed FSM decomposed 
RNS with improved system performance and energy efficiency as shown in Table 4. When compared with 
the conventional RNS FIR design with the proposed FSM based decomposed RNS FIR, the Logic Elements 
(LEs) were reduced by 4.57%, the frequency component is increased by 31.79%, number of LUTs is reduced 
by 42.85%, the number of transitions is reduced by 5.15% and the power dissipation was reduced by 13.83%. 


Table 4. Performances analyzes of proposed FIR design with QUARTUS II hardware synthesis 


Multiplier Model Hardware used F max Number of | Number of transitions Power dissipation 
(LEs) (MHz) LUTSs /size (millions/sec) (mW) 
Conventional RNS FIR [29] 3016 46.02 MHz 3/7 72.521 200.17mW 
FSM based decomposed RNS FIR 2878 67.47 MHz 3/4 68.779 172.48mW 
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5. CONCLUSION 

In this paper, a memory-efficient reverse conversion-based RNS system and associated FSM-based 
architectural optimization are presented to narrow down the energy level utilization in RNS based FIR filters 
for EEG signal classifications. It is noted that design optimization is carried out only in the reverse 
conversion stage while other processing units are kept as generic for parametric variations within the RNS 
system. The decomposed LUT-based reverse conversion and FSM ordering techniques offer significant 
hardware complexity reduction and result in considerable energy efficiency as compared to direct single 
compound memory-based reverse conversion realization. As stated earlier, this alternative form of RNS FIR 
filter structure shows the least significant performance trade-off for higher-order FIR filters. The area is 
decreased by 4.5% and the frequency is improved by 24.08% for 8 bit word length and the area is decreased 
by15.36% and the frequency is improved by 16.18% for 8 bit word length. In 4 tap FIR filter, area in the 
proposed method is decreased by 4.15% and the frequency component is increased by 13.30% whereas in 16 
tap FIR filter area is decreased by 4.98% and the frequency is increased by 19.57% when compared with the 
conventional reverse computation method. When compared with the conventional RNS FIR design with the 
proposed FSM based decomposed RNS FIR, the logic elements (LEs) were reduced by 4.57%, the frequency 
component is increased by 31.79%, number of LUTs is reduced by 42.85%, the number of transitions is 
reduced by 5.15% and the power dissipation was reduced by 13.83%. 
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